STAT 331 Final Project
Description, Hypothesis, & Explanation of Methods
Data and Variable Descriptions
Our project incorporates two datasets: lex.csv and hapiscore_whr.csv. In the lex dataset, the variable “country” lists 195 different countries, and each column represents a year from 1800 to 2100. The values indicate the projected lifespan of newborn infants if mortality rates at different ages remain constant throughout their lives for each country. The data from 1800 to 1970 comes from version 7 by Mattias Lindgren, compiled from 100 sources. For the period 1970 to 2016, the main source is the Institute for Health Metrics and Evaluation (IHME) and their Global Burden of Disease Study. IHME data from 1970 to 2017 was published in September 2017, and data from 2017 to 2100 was sourced from the United Nations’ World Population Prospects 2019.
In the happiness score dataset, the “country” column lists 163 different countries, and each year from 2005 to 2022 has a corresponding column. The data set focuses on the “Happiness score” or “Cantril life ladder”, representing the national average response to a life evaluation question. The scale is converted from 0-10 to 0-100 for easier communication. Data is sourced from the World Happiness Report and Gallup World Poll surveys conducted in multiple countries and languages.
Hypothesis About Variables
We hypothesize that life expectancy will increase with happiness score. This is more of an intuitive belief, as we would expect that with a high percentage of happiness, people in the respective countries would live longer. Of course, this can vary from country to country. It is possible that more developed countries could have a higher happiness score and life expectancy rate, while less developed countries result in the opposite.
Data Cleaning Process
<<<<<<< HEADData cleaning and analysis were performed using the “tidyverse” package in R, involving column selection, pivoting to long format, and removal of null values. We decided on a range from 2006 to 2021, since the limiting dataset was the happiness score’s date range. In the lex clean dataset, there were 27 missing values, which were all in the ‘life expectancy’ category. There were 9 countries with 3 missing values each, in the years 2020, 2021, and 2022. Since there were only 9 countries with a few values missing each, we chose to keep the dates up to 2021.
The happiness clean set had more missing values, with 732 total. A good portion of them (136) were from 2005, with the rest occurring randomly across all other years of the study. So, we concluded that there were no invisible variables affecting where missing values were occurring. As a result of our missing value analysis, we chose to further restrict the date range to 2006. We elected to drop the missing values in the dataset. The data sets were then joined using an inner join operation based on common columns.
=======Data cleaning and analysis were performed using the “tidyverse” package in R, involving column selection, pivoting to long format, and removal of null values. We decided on a range from 2005 to 2022, since the limiting dataset was the happiness score’s date range. In the lex clean dataset, there were 27 missing values, which were all in the ‘life expectancy’ category. There were 9 countries with 3 missing values each, in the years 2020, 2021, and 2022. Since there were only 9 countries with a few values missing each, we chose to keep the dates up to 2022.
The happiness clean set had more missing values, with 732 total. A good portion of them (136) were from 2005, with the rest occurring randomly across all other years of the study. So, we concluded that there were no invisible variables affecting where missing values were occurring. As a result of our missing value analysis, we chose to further restrict the date range to 2006. We elected to leave the missing values in the dataset. The data sets were then joined using an inner join operation based on common columns.
>>>>>>> a8cc776dcd48ad09a069757703abab20e4a16f10Data Visualization
Relationship Between Quantitative Variables
The visualization depicts the relationship between the “Happiness Score” and “Life Expectancy” variables using a scatter plot. The scatter plot shows individual data points as circles, where the x-axis represents the happiness score and the y-axis represents life expectancy. The points are colored in a shade of medium orchid with an alpha value of 0.4, giving them a slightly transparent appearance. The plot exhibits a positive linear trend, indicating that as the happiness score increases, life expectancy tends to be higher.